{
 "metadata": {
  "name": "",
  "signature": "sha256:2392bedff6d89f307fd43ddf2fa2d9bd14dc2376d7e2603aaee29fe2ca1d63d3"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=http://continuum.io/media/img/continuum_analytics_logo.png align=\"right\" width=\"30%\">\n",
      "\n",
      "# Blaze - Large collections of large CSV files\n",
      "\n",
      "Large CSV files, or large collections of many CSV files are a common-yet-cumbersome data format.  If the data in these files doesn't fit into memory we're usually forced to \n",
      "\n",
      "1.  Ingest the data into a database\n",
      "2.  Perform gymnastics while using Pandas\n",
      "\n",
      "This notebook demonstrates that Blaze operates smoothly on such data.  It then shows exactly how Blaze uses Pandas by effectively automating the gynmastics in step 2.  We perform an out-of-core split-apply-combine operation on the NYC Taxicab dataset while using a comfortably small amount of space.\n",
      "\n",
      "All computations in this notebook were done on a personal laptop with smallish memory using a recent version of blaze\n",
      "\n",
      "    conda install -c blaze blaze\n",
      "    \n",
      "    or \n",
      "    \n",
      "    pip install blaze"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Download Data\n",
      "\n",
      "Note, this is a lot of data to download.  It's also a lot of data to serve.  You might consider grabbing this from a torrent instead."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# !wget https://nyctaxitrips.blob.core.windows.net/data/trip_data_{1,2,3,4,5,6,7,8,9,10,11,12}.csv.zip"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Inspect with Blaze\n",
      "\n",
      "We design the `blaze.Data` constructor to be easy to use.  Here we give it a globstring of the files we want to analyze.\n",
      "\n",
      "It gives us a quick `head` of the data immediately, even though there are several gigabytes of data.  If you're unfamiliar with the data you may want to quickly peruse the columns and values."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from blaze import *\n",
      "\n",
      "d = Data('trip_data_*.csv')\n",
      "d"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>medallion</th>\n",
        "      <th>hack_license</th>\n",
        "      <th>vendor_id</th>\n",
        "      <th>rate_code</th>\n",
        "      <th>store_and_fwd_flag</th>\n",
        "      <th>pickup_datetime</th>\n",
        "      <th>dropoff_datetime</th>\n",
        "      <th>passenger_count</th>\n",
        "      <th>trip_time_in_secs</th>\n",
        "      <th>trip_distance</th>\n",
        "      <th>pickup_longitude</th>\n",
        "      <th>pickup_latitude</th>\n",
        "      <th>dropoff_longitude</th>\n",
        "      <th>dropoff_latitude</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0 </th>\n",
        "      <td> 89D227B655E5C82AECF13C3F540D4CF4</td>\n",
        "      <td> BA96DE419E711691B9445D6A6307C170</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-01 15:11:48</td>\n",
        "      <td>2013-01-01 15:18:10</td>\n",
        "      <td> 4</td>\n",
        "      <td>  382</td>\n",
        "      <td>  1.0</td>\n",
        "      <td>-73.978165</td>\n",
        "      <td> 40.757977</td>\n",
        "      <td>-73.989838</td>\n",
        "      <td> 40.751171</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1 </th>\n",
        "      <td> 0BD7C8F5BA12B88E0B67BED28BEA73D8</td>\n",
        "      <td> 9FD8F69F0804BDB5549F40E9DA1BE472</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-06 00:18:35</td>\n",
        "      <td>2013-01-06 00:22:54</td>\n",
        "      <td> 1</td>\n",
        "      <td>  259</td>\n",
        "      <td>  1.5</td>\n",
        "      <td>-74.006683</td>\n",
        "      <td> 40.731781</td>\n",
        "      <td>-73.994499</td>\n",
        "      <td> 40.750660</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2 </th>\n",
        "      <td> 0BD7C8F5BA12B88E0B67BED28BEA73D8</td>\n",
        "      <td> 9FD8F69F0804BDB5549F40E9DA1BE472</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-05 18:49:41</td>\n",
        "      <td>2013-01-05 18:54:23</td>\n",
        "      <td> 1</td>\n",
        "      <td>  282</td>\n",
        "      <td>  1.1</td>\n",
        "      <td>-74.004707</td>\n",
        "      <td> 40.737770</td>\n",
        "      <td>-74.009834</td>\n",
        "      <td> 40.726002</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3 </th>\n",
        "      <td> DFD2202EE08F7A8DC9A57B02ACB81FE2</td>\n",
        "      <td> 51EE87E3205C985EF8431D850C786310</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-07 23:54:15</td>\n",
        "      <td>2013-01-07 23:58:20</td>\n",
        "      <td> 2</td>\n",
        "      <td>  244</td>\n",
        "      <td>  0.7</td>\n",
        "      <td>-73.974602</td>\n",
        "      <td> 40.759945</td>\n",
        "      <td>-73.984734</td>\n",
        "      <td> 40.759388</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4 </th>\n",
        "      <td> DFD2202EE08F7A8DC9A57B02ACB81FE2</td>\n",
        "      <td> 51EE87E3205C985EF8431D850C786310</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-07 23:25:03</td>\n",
        "      <td>2013-01-07 23:34:24</td>\n",
        "      <td> 1</td>\n",
        "      <td>  560</td>\n",
        "      <td>  2.1</td>\n",
        "      <td>-73.976250</td>\n",
        "      <td> 40.748528</td>\n",
        "      <td>-74.002586</td>\n",
        "      <td> 40.747868</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5 </th>\n",
        "      <td> 20D9ECB2CA0767CF7A01564DF2844A3E</td>\n",
        "      <td> 598CCE5B9C1918568DEE71F43CF26CD2</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-07 15:27:48</td>\n",
        "      <td>2013-01-07 15:38:37</td>\n",
        "      <td> 1</td>\n",
        "      <td>  648</td>\n",
        "      <td>  1.7</td>\n",
        "      <td>-73.966743</td>\n",
        "      <td> 40.764252</td>\n",
        "      <td>-73.983322</td>\n",
        "      <td> 40.743763</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6 </th>\n",
        "      <td> 496644932DF3932605C22C7926FF0FE0</td>\n",
        "      <td> 513189AD756FF14FE670D10B92FAF04C</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-08 11:01:15</td>\n",
        "      <td>2013-01-08 11:08:14</td>\n",
        "      <td> 1</td>\n",
        "      <td>  418</td>\n",
        "      <td>  0.8</td>\n",
        "      <td>-73.995804</td>\n",
        "      <td> 40.743977</td>\n",
        "      <td>-74.007416</td>\n",
        "      <td> 40.744343</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7 </th>\n",
        "      <td> 0B57B9633A2FECD3D3B1944AFC7471CF</td>\n",
        "      <td> CCD4367B417ED6634D986F573A552A62</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-07 12:39:18</td>\n",
        "      <td>2013-01-07 13:10:56</td>\n",
        "      <td> 3</td>\n",
        "      <td> 1898</td>\n",
        "      <td> 10.7</td>\n",
        "      <td>-73.989937</td>\n",
        "      <td> 40.756775</td>\n",
        "      <td>-73.865250</td>\n",
        "      <td> 40.770630</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8 </th>\n",
        "      <td> 2C0E91FF20A856C891483ED63589F982</td>\n",
        "      <td> 1DA2F6543A62B8ED934771661A9D2FA0</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-07 18:15:47</td>\n",
        "      <td>2013-01-07 18:20:47</td>\n",
        "      <td> 1</td>\n",
        "      <td>  299</td>\n",
        "      <td>  0.8</td>\n",
        "      <td>-73.980072</td>\n",
        "      <td> 40.743137</td>\n",
        "      <td>-73.982712</td>\n",
        "      <td> 40.735336</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9 </th>\n",
        "      <td> 2D4B95E2FA7B2E85118EC5CA4570FA58</td>\n",
        "      <td> CD2F522EEE1FF5F5A8D8B679E23576B3</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-07 15:33:28</td>\n",
        "      <td>2013-01-07 15:49:26</td>\n",
        "      <td> 2</td>\n",
        "      <td>  957</td>\n",
        "      <td>  2.5</td>\n",
        "      <td>-73.977936</td>\n",
        "      <td> 40.786983</td>\n",
        "      <td>-73.952919</td>\n",
        "      <td> 40.806370</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10</th>\n",
        "      <td> E12F6AF991172EAC3553144A0AF75A19</td>\n",
        "      <td> 06918214E951FA0003D1CC54955C2AB0</td>\n",
        "      <td> CMT</td>\n",
        "      <td> 1</td>\n",
        "      <td> N</td>\n",
        "      <td>2013-01-08 13:11:52</td>\n",
        "      <td>2013-01-08 13:19:50</td>\n",
        "      <td> 1</td>\n",
        "      <td>  477</td>\n",
        "      <td>  1.3</td>\n",
        "      <td>-73.982452</td>\n",
        "      <td> 40.773167</td>\n",
        "      <td>-73.964134</td>\n",
        "      <td> 40.773815</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "                           medallion                      hack_license  \\\n",
        "0   89D227B655E5C82AECF13C3F540D4CF4  BA96DE419E711691B9445D6A6307C170   \n",
        "1   0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472   \n",
        "2   0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472   \n",
        "3   DFD2202EE08F7A8DC9A57B02ACB81FE2  51EE87E3205C985EF8431D850C786310   \n",
        "4   DFD2202EE08F7A8DC9A57B02ACB81FE2  51EE87E3205C985EF8431D850C786310   \n",
        "5   20D9ECB2CA0767CF7A01564DF2844A3E  598CCE5B9C1918568DEE71F43CF26CD2   \n",
        "6   496644932DF3932605C22C7926FF0FE0  513189AD756FF14FE670D10B92FAF04C   \n",
        "7   0B57B9633A2FECD3D3B1944AFC7471CF  CCD4367B417ED6634D986F573A552A62   \n",
        "8   2C0E91FF20A856C891483ED63589F982  1DA2F6543A62B8ED934771661A9D2FA0   \n",
        "9   2D4B95E2FA7B2E85118EC5CA4570FA58  CD2F522EEE1FF5F5A8D8B679E23576B3   \n",
        "10  E12F6AF991172EAC3553144A0AF75A19  06918214E951FA0003D1CC54955C2AB0   \n",
        "\n",
        "   vendor_id  rate_code store_and_fwd_flag     pickup_datetime  \\\n",
        "0        CMT          1                  N 2013-01-01 15:11:48   \n",
        "1        CMT          1                  N 2013-01-06 00:18:35   \n",
        "2        CMT          1                  N 2013-01-05 18:49:41   \n",
        "3        CMT          1                  N 2013-01-07 23:54:15   \n",
        "4        CMT          1                  N 2013-01-07 23:25:03   \n",
        "5        CMT          1                  N 2013-01-07 15:27:48   \n",
        "6        CMT          1                  N 2013-01-08 11:01:15   \n",
        "7        CMT          1                  N 2013-01-07 12:39:18   \n",
        "8        CMT          1                  N 2013-01-07 18:15:47   \n",
        "9        CMT          1                  N 2013-01-07 15:33:28   \n",
        "10       CMT          1                  N 2013-01-08 13:11:52   \n",
        "\n",
        "      dropoff_datetime  passenger_count  trip_time_in_secs  trip_distance  \\\n",
        "0  2013-01-01 15:18:10                4                382            1.0   \n",
        "1  2013-01-06 00:22:54                1                259            1.5   \n",
        "2  2013-01-05 18:54:23                1                282            1.1   \n",
        "3  2013-01-07 23:58:20                2                244            0.7   \n",
        "4  2013-01-07 23:34:24                1                560            2.1   \n",
        "5  2013-01-07 15:38:37                1                648            1.7   \n",
        "6  2013-01-08 11:08:14                1                418            0.8   \n",
        "7  2013-01-07 13:10:56                3               1898           10.7   \n",
        "8  2013-01-07 18:20:47                1                299            0.8   \n",
        "9  2013-01-07 15:49:26                2                957            2.5   \n",
        "10 2013-01-08 13:19:50                1                477            1.3   \n",
        "\n",
        "    pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  \n",
        "0         -73.978165        40.757977         -73.989838         40.751171  \n",
        "1         -74.006683        40.731781         -73.994499         40.750660  \n",
        "2         -74.004707        40.737770         -74.009834         40.726002  \n",
        "3         -73.974602        40.759945         -73.984734         40.759388  \n",
        "4         -73.976250        40.748528         -74.002586         40.747868  \n",
        "5         -73.966743        40.764252         -73.983322         40.743763  \n",
        "6         -73.995804        40.743977         -74.007416         40.744343  \n",
        "7         -73.989937        40.756775         -73.865250         40.770630  \n",
        "8         -73.980072        40.743137         -73.982712         40.735336  \n",
        "9         -73.977936        40.786983         -73.952919         40.806370  \n",
        "..."
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Our Model Computation\n",
      "\n",
      "We now compute the average distance and number of rides grouped by the number of passengers riding in the cab.\n",
      "\n",
      "Looks like single passenger trips are the most common while three passenger trips are surprisingly long distance."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "by(d.passenger_count, avg_distance=d.trip_distance.mean(), \n",
      "                             count=d.passenger_count.count())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>passenger_count</th>\n",
        "      <th>avg_distance</th>\n",
        "      <th>count</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0 </th>\n",
        "      <td>   0</td>\n",
        "      <td>  0.833625</td>\n",
        "      <td>      5035</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1 </th>\n",
        "      <td>   1</td>\n",
        "      <td>  9.033823</td>\n",
        "      <td> 121959711</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2 </th>\n",
        "      <td>   2</td>\n",
        "      <td>  6.992290</td>\n",
        "      <td>  23517494</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3 </th>\n",
        "      <td>   3</td>\n",
        "      <td> 14.089989</td>\n",
        "      <td>   7315829</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4 </th>\n",
        "      <td>   4</td>\n",
        "      <td>  5.286269</td>\n",
        "      <td>   3582103</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5 </th>\n",
        "      <td>   5</td>\n",
        "      <td>  2.995215</td>\n",
        "      <td>  10034696</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6 </th>\n",
        "      <td>   6</td>\n",
        "      <td>  2.956734</td>\n",
        "      <td>   6764789</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7 </th>\n",
        "      <td>   7</td>\n",
        "      <td>  2.214286</td>\n",
        "      <td>        35</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8 </th>\n",
        "      <td>   8</td>\n",
        "      <td>  3.360400</td>\n",
        "      <td>        25</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9 </th>\n",
        "      <td>   9</td>\n",
        "      <td>  2.226538</td>\n",
        "      <td>        26</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10</th>\n",
        "      <td> 129</td>\n",
        "      <td>  0.920000</td>\n",
        "      <td>         1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "    passenger_count  avg_distance      count\n",
        "0                 0      0.833625       5035\n",
        "1                 1      9.033823  121959711\n",
        "2                 2      6.992290   23517494\n",
        "3                 3     14.089989    7315829\n",
        "4                 4      5.286269    3582103\n",
        "5                 5      2.995215   10034696\n",
        "6                 6      2.956734    6764789\n",
        "7                 7      2.214286         35\n",
        "8                 8      3.360400         25\n",
        "9                 9      2.226538         26\n",
        "..."
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It's useful to note here all the things we *didn't do*.\n",
      "\n",
      "1.  We didn't worry about running out of memory\n",
      "2.  We didn't worry about calling `pandas.read_csv` with the right arguments to make this fast\n",
      "3.  We didn't worry about handling each CSV file separately\n",
      "\n",
      "Blaze handles these things for us.  It drives Pandas intelligently, breaks up our computation into pieces it can perform in memory, and then shoves data through Pandas as fast as it can."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## How fast is fast?\n",
      "\n",
      "How long did it take to process 16 GB and do an out-of-core split-apply-combine operation?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(), \n",
      "                                    count=d.passenger_count.count())\n",
      "\n",
      "%time _ = compute(expr)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 2min 13s, sys: 13.6 s, total: 2min 26s\n",
        "Wall time: 3min 7s\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Not great, but not terrible.  This is about as fast as Postgres does it *after* the data has been loaded into Postgres."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## How much memory did we need?\n",
      "\n",
      "Not very much, A few hundred megabytes by default.  \n",
      "\n",
      "Anecdotally I used to switch to a big machine when I ran out of memory.  Now I switch to my big machine when I run out of disk space."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Multiprocessing\n",
      "\n",
      "Blaze can use many cores to accelerate this work.  It still uses pandas in each core but it now just splits apart the computation intelligently and directs different CSV files to different cores.\n",
      "\n",
      "This gives a significant speedup even on my laptop.  On a large workstation with more cores this speedup is more pronounced."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import multiprocessing\n",
      "pool = multiprocessing.Pool(4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%time _ = compute(expr, map=pool.map)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 133 ms, sys: 28.8 ms, total: 161 ms\n",
        "Wall time: 1min 1s\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## What does Blaze do internally?\n",
      "\n",
      "Lets say we didn't want to use Blaze but preferred to just use Pandas and some elbow grease.\n",
      "\n",
      "How can we use Pandas in-memory to handle an out-of-memory dataset?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Simple Example\n",
      "\n",
      "As an easier example, lets compute the mean of a single column on a single CSV file.  We'll use Pandas' chunked CSV reader.  \n",
      "\n",
      "To compute an out-of-core mean we'll compute a running total and running count for each chunk"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "totals = []\n",
      "counts = []\n",
      "\n",
      "for chunk in pd.read_csv('trip_data_1.csv', chunksize=1000000, usecols=['passenger_count']):\n",
      "    totals.append(chunk.passenger_count.sum())\n",
      "    counts.append(chunk.passenger_count.count())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Then we perform a second computation on these intermediate results"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "1.0 * sum(totals) / sum(counts)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "1.6973720977368634"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is exactly what Blaze does when we type the following."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Data('trip_data_1.csv').passenger_count.mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "1.6973720977368634"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": [
        "1.6973720977368634"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### We break computations into pieces\n",
      "\n",
      "So to perform one computation, `mean`, on an out of core dataset, we end up performing two different sets of computations\n",
      "\n",
      "1.  `sum` and `count` on each in-memory chunk\n",
      "2.  `sum` and `division` on the aggregated results from step #1\n",
      "\n",
      "This breakdown works on a surprisingly large class of operations.  Split-apply-combine operations are handled similarly.  We perform a different split-apply-combine operation on each chunk and then another on the aggregated results.\n",
      "\n",
      "For more information on this see Blaze's [out-of-core docs](http://blaze.pydata.org/docs/dev/ooc.html#complex-example)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### When doesn't this work?\n",
      "\n",
      "This doesn't work on cases like `sort` or `join` nor on any computation for which the intermediate results don't fit in memory.  Of course, you can still `sort` or `join` computations, just so long as some data reducing step comes first."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Complex Example\n",
      "\n",
      "OK, so lets go through and solve the entire out-of-core split-apply-combine problem on all of the CSV files.\n",
      "\n",
      "Feel free to ignore this example.  It's mostly here to show explicitly exactly what Blaze does for those who care and to generally impress those who don't.\n",
      "\n",
      "Hold on to your butts."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%time\n",
      "from glob import glob\n",
      "\n",
      "# Specifying active columns at parse time greatly improves performance\n",
      "active_columns = ['passenger_count', 'trip_distance']\n",
      "intermediates = []\n",
      "\n",
      "# Do a split-apply-combine operation on each chunk of each CSV file\n",
      "for fn in sorted(glob('trip_data_*.csv')):\n",
      "    for df in pd.read_csv(fn, usecols=active_columns, \n",
      "                          chunksize=1000000, skipinitialspace=True):\n",
      "        chunk = df.groupby('passenger_count').agg({'passenger_count': ['count'],\n",
      "                                                   'trip_distance': ['sum', 'count']})\n",
      "        intermediates.append(chunk)\n",
      "\n",
      "# Bring those results together.  These are much smaller and so likely fit in memory\n",
      "df = pd.concat(intermediates, axis=0)\n",
      "df.columns = ['trip_distance_sum', 'trip_distance_count', 'passenger_count_count']  # Flatten multi-index\n",
      "\n",
      "# Perform second split-apply-combine operation on those intermediate results\n",
      "groups = df.groupby(df.index)  # group once for many of the following applies\n",
      "df2 = pd.concat([groups.trip_distance_sum.sum(),\n",
      "                 groups.trip_distance_count.sum(),\n",
      "                 groups.passenger_count_count.sum()],\n",
      "                axis=1)\n",
      "\n",
      "df2['avg_distance'] = df2.trip_distance_sum / df2.trip_distance_count\n",
      "df2['count'] = df2.passenger_count_count\n",
      "\n",
      "# Select out the columns we want\n",
      "result = df2[['avg_distance', 'count']]\n",
      "result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 2min, sys: 8.74 s, total: 2min 9s\n",
        "Wall time: 2min 49s\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Conclusion\n",
      "\n",
      "*Blaze is a general library to bring expert data analysis into the hands of everyday users*\n",
      "\n",
      "The example above is a PITA to do by hand.  More than that it has a number of tricks not known to many Pandas users. \n",
      "\n",
      "Fortunately Blaze automates these tricks, making them routine for a broad class of problems.  Moreso it does this from a relatively naive user-focused syntax.\n",
      "\n",
      "```Python\n",
      "d = Data('trip_data_*.csv')\n",
      "by(d.passenger_count, avg_distance=d.trip_distance.mean(), \n",
      "                             count=d.passenger_count.count())\n",
      "```\n",
      "\n",
      "Hopefully this example helps to explain how Blaze chunks apart computations on large CSV files to operate in memory.\n",
      "\n",
      "This also highlights the relationship between Blaze and Pandas.  Pandas is Blaze's preferred library when it performs in-memory analytics on tabular data.  In these cases it's Blaze's job to arrange data well and call Pandas with the right arguments while it's Pandas' job to actually do the computation.\n",
      "\n",
      "As a reminder, large CSV files are just one application of Blaze.  Blaze provides a similar experience and set-of-tricks for SQL, Spark, and Binary storage files.\n",
      "\n",
      "You can learn more about Blaze at http://blaze.pydata.org/"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}